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Abstract 

The hidden Markov model (HMM) is a generative model that treats sequential data 
under the assumption that each observation is conditioned on the state of a discrete 
hidden variable that evolves in time as a Markov chain. In this paper, we derive 
a novel algorithm to cluster HMMs through their probability distributions. We 
propose a hierarchical EM algorithm that i) clusters a given collection of HMMs 
into groups of HMMs that are similar, in terms of the distributions they represent, 
and ii) characterizes each group by a "cluster center", i.e., a novel HMM that is 
representative for the group. We present several empirical studies that illustrate 
the benefits of the proposed algorithm. 



1 Introduction 

The hidden Markov model (HMM) fT^ is a probabilistic model that assumes a signal is generated 
by a double embedded stochastic process. A hidden state process, which evolves over discrete time 
instants as a Markov chain, encodes the dynamics of the signal, and an observation process, at 
each time conditioned on the current state, encodes the appearance of the signal. HMMs have been 
successfully applied to a variety of applications, including speech recognition O], music analysis ||2l, 
on-line hand-writing recognition 131, analysis of biological sequences 14j|. 

The focus of this paper is an algorithm for clustering HMMs. More precisely, we design an algorithm 
that, given a collection of HMMs, partitions them into K clusters of "similar" HMMs, while also 
learning a representative HMM "cluster center" that summarizes each cluster appropriately. This is 

1http://acsweb.ucsd.edu/~ecoviell7| 
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similar to standard k-means clustering, except that the data points are HMMs now instead of vectors 
in R''. Such HMM clustering algorithm has various potential applications, ranging from hierarchical 
clustering of sequential data (e.g., speech or motion sequences), over hierarchical indexing for fast 
retrieval, to reducing the computational complexity of estimating mixtures of HMMs from large 
datasets via hierarchical modeling (e.g., to learn semantic annotation models for music and video). 

One possible approach is to group HMMs in parameter space. However, as HMM parameters lie on 
a non-linear manifold, they cannot be clustered by a simple application of the k-means algorithm, 
which assumes real vectors in a Euclidean space. One solution, proposed in |5 |, first constructs an 
appropriate similarity matrix between all HMMs that are to be clustered (e.g., based on the Bhat- 
tacharya affinity, which depends on the HMM parameters |6|), and then applies spectral clustering. 
While this approach has proven successful to group HMMs into similar clusters |5 1, it does not allow 
to generate novel HMMs as cluster centers. Instead, one is limited to representing each cluster by 
one of the given HMMs, e.g., the HMM which the spectral clustering procedure maps the closest to 
each spectral clustering center. This may be suboptimal for various applications of HMM clustering. 

An alternative to clustering the HMMs in parameter space is to cluster them directly with respect to 
the probability distributions they represent. To cluster Gaussian probability distributions, Vascon- 
celos and Lipmann [7| proposed a hierarchical expectation-maximization (HEM) algorithm. This 
algorithm starts from a Gaussian mixture model (GMM) with K^''^ components and reduces it to 
another GMM with fewer components, where each of the mixture components of the reduced GMM 
represents, i.e., clusters, a group of the original Gaussian mixture components. More recently, Chan 
et al. |8 1 derived an HEM algorithm to cluster dynamic texture (DT) models (i.e., linear dynamical 
systems, LDSs) through their probability distributions. HEM has been applied successfully to con- 
struct GMM hierarchies for efficient image indexing [9jj, to cluster video represented by DTs 1 10|, 
and to estimate GMMs or DT mixtures (DTMs,i.e., LDS mixtures) from large datasets for semantic 
annotation of images fm . video fllOl and music lfT2l[T3l . 

To extend the HEM framework for GMMs to mixtures of HMMs (H3Ms), additional marginaliza- 
tion of the hidden-state processes is required, as for DTMs. However, while Gaussians and DTs 
allow tractable inference in the E-step of HEM, this is no longer the case for HMMs. Therefore, 
in this work, we derive a variational formulation of the HEM algorithm (VHEM) to cluster HMMs 
through their probability distributions, based on a variational approximation derived by Hershey 
114 1 . The resulting algorithm not only allows to cluster HMMs, it also learns novel HMMs that 
are representative centers of each cluster, in a way that is consistent with the underlying generative 
probabilistic model of the HMM. The resulting VHEM algorithm can be generalized to handle other 
classes of graphical models, for which standard HEM would otherwise be intractable, by leverag- 
ing similar variational approximations. The efficacy of the VHEM-H3M algorithm is demonstrated 
for various applications, including hierarchical motion clustering, semantic music annotation, and 
online hand-writing recognition. 

2 The hidden Markov model 

A hidden Markov model (HMM) M. assumes a sequence of r observations yi-r is generated by a 
double embedded stochastic process, where each observation yt at time t depends on the state of 
a discrete hidden variable xt and where the sequence of hidden states xi-t- evolves as a first order 
Markov process. The discrete variables can take one of N values, and the evolution of the hidden 
process is encoded in a transition matrix A — [a^ .y]^ .^=1 ... jv whose entries are the state transition 
probabilities ap,^ — P{xt+i = ^\xt — (3), and an initial state distribution tt = [tti, . . . , ttn], where 
TTp = P{xi — (3). Each state generates observation accordingly to an emission probability density 
function, p{yt\xt — (3, A4), which here we assume to be a Gaussian mixture model: 

M 

p{y\x^f3) ^ c/3,™A/'(y;Ai^,m,S^,m) (1) 

m— 1 

where M is the number of Gaussian components and cp^m are the mixing weights. In the fol- 
lowing, when referring to a sequences of length r, we will use the notation tTxi-^P{xi:t) = 
^^1 Y[t=2 '^xt-i,xt to represent the probability that the HMM generates the state sequence xi-r- 
The HMM is specified by the parameters M. — {tt, A, c^,,„, M0,m, S/3,m} which can be efficiently 
learned with the forward-backward algorithm IT], which is based on maximum likelihood. 
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A hidden Markov mixture model (H3M) [ 15 1 models a set observation sequences as samples from a 
group of K hidden Markov models, which represent different sub-behaviors. For a given sequence, 
an assignment variable z ^ multinomial(a;i , • ■ • ujk) selects the parameters of one of the K HMMs, 
where the k — th HMM is selected with probability ujk- Each mixture component is parametrized by 
Mz = {tt^, A^,c^ ,„,/i| ,„,S| „j} and the H3M is parametrized by 7W = {uJz,Mz}f=i- Given 

a collection S = {yl-r, ■ ■ ■ , of relevant observation sequences, the parameters of Ai can be 
learned with recourse to the EM algorithm 1 15] . 

To reduce clutter, here we assume that all the HMMs have the same number of states N and that all 
emission probabilities have M mixture components, though a more general case could be derived. 

3 Variational hierarchical EM algorithm for H3Ms 

The hierarchical expectation maximization algorithm (HEM) |7| was initially proposed to cluster 
Gaussian distributions, by reducing a GMM with a large number of components to a new GMM 
with fewer components, and then extended to dynamic texture models [8|. In this section we derive 
a variational formulation of the HEM algorithm (VHEM) to cluster HMMs. 

3.1 Formulation 

Let A^(^) be a base hidden Markov mixture model with K'-''^ components. The goal of the VHEM 
algorithm is to find a reduced mixture ^A'■■^^ with if''") < K'^''^ components that represent 
The likelihood of a random sequence yi-r ~ is given by 

p{yi..r\M^'^) = E Wl:r l^(') = A^^'^), (2) 
i=l 

where z^^^ ~ multinomial(wJ^^ , • • • w^jf,, ) is the hidden variable that indexes the mixture compo- 
nents. p(j/i:t-|z = «, TW'''^) is the likelihood of yi-r under the ith mixture component, and is the 
prior weight for the ith component. The likelihood of the random sequence yi-^- ^ A^*^*") is 

p(2/i:.|A^W) ^ ^^^^(y.^^l^W ^^^-^A^M), (3) 

where z^""^ ~ multinomial(a;J'^\ ■ • ■ ,a;^(^, ) is the hidden variable for indexing components in 
>(('■). Note that we will always use i and j to index the components of the base model, A^*^'''', 
and the reduced model, J^^'^\ respectively. In addition, we will always use /? and 7 to index the 
hidden states of Aif^ and p and a for Al^'''- To reduce clutter we will denote piyi-.rlz^'^^ = 
ijAJ'-''') = p{yi;r\Mf^), and Ey^^|^(6)^(b)^j[-] = E^(b)[-]. In addition, we will use short-hands 

(h) (r) 

^Al and Mi when conditioning over specific state sequences. For example, we denote 
p{yi:r\xi:r = Pi-.r^M^^) = p{yi..r\Mfl^J, and E^^^^ [■] = E^(6)^ [•]. Finally, 
we will use m and £ for indexing the gaussian mixture components of the emission probabilities of 
the base respectively reduced mixture, which we will denote as A^^''|„' and Alp 

3.2 Parameter estimation - a variational formulation 

To obtain the reduced model, we consider a set of N virtual samples drawn from the base model 
A^^*"', such that Ni = Nlo^'''^ samples are drawn from the ith component. We denote the set of Ni 
virtual samples for the zth component as = where j/^'^'"'' ^ A^^-'''', and the entire set 

of N samples as y = . Note that, in this formulation, we are not generating virtual samples 

{x^i.'^\ y^.':^^ } according to the joint distribution of the base component, p(xi;r , yi-.rlM^^ ) . The 
reason is that the hidden state spaces of each base mixture component Adf ^ may have a different 
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representation, (e.g., , the numbering of the hidden states may be permuted between the compo- 

(r) 

nents). This basis mismatch will cause problems when the parameters of A4 j are computed from 

virtual samples of the hidden states of {A^.^''''}. Instead, we must treat = {x'^i.'™^} as "missing" 
information, as in the standard EM formulation. 

The likelihood of the virtual samples is 

^(A^W) = logp(y|A^W) ^ J2 logp(^^l>(^"^) = E log E ^J"^P(Y^\M^i^)■ (4) 

i=l i=l j = l 

In particular, for a given A4'^^\ the computation of \ogp{Yi\M.'-^'^) can be carried out solving the 
optimization problems |.16.,17 |: 

iogp(y,|xW) 



max \ogp{Y,\M^^'^^) ~ D{V^{z,)\\P{z, ^ jlY^M^^'^)) 



max VT'.I.z^ = i) logJp + \ogp{Y,\M^^^) - log7'.(z. = j) 

3 



(5) 
(6) 



for 2 = 1,..., K^^\ where Vi{zi) are variational distributions and D{p\\q) — J p{y) log ^^dy 
is the Kullback-Leibler (KL) divergence between two distributions, p and q. In order to obtain a 
consistent clustering fT\, we assume the whole sample Yi is assigned to the same component of the 

= 1, Vi and Zij > Q\fi,j, and ^ becomes: 



reduced model, i.e., Viizi = j) = Zij, with 



logp(r,|X('-)) = max^z,Jlogwj'')+logp(K,|A^fVlogz, 



(7) 



Considering that virtual samples Yi are independent for different values of i, we can solve (|6]l inde- 
pendently for each i, using the result in Section |6^ and find 



(8) 



uj'f '' exp{7ViE_^(b) 


"logp(r,|A^fV 


} 




"iogp(y,iA^;:))' 


} 



For the likelihood of the virtual samples p{Yi\M''^'') we use 



Ni 



logp{Y,\M^;^) = 5]logp(yj:f)|A^(''))=7V, 

m— 1 

'\ogp{y,..r\M^;^) 



±J2^ogpiytr^\Ml^^] 



(9) 
(10) 



where ( fTOb follows from the law of large numbers Q (as Ni oo). Substituting JTOl i in ^ we get 
the formula for Zij derived in 



(11) 



wj''^ exp{7VjE_^(i,) 


\ogp{y^..^M'^p) 


} 


Er^l 4^ exp{iV.E^<., 


\ogp{yi..r\M^p) 


} 



We follow a similar approach to compute the expected log-likelihoods E^(6) \ogp{yi-r\M''p) 



We introduce variational distributions P^'^"* to approximate P{xi;r\yi:Ti M.^) for observations 



E 



2/1:7- ~ Ai\ emitted by state sequence Pi-r, and solve the maximization problem 

\\ogp{y,..r\M'^p) 



max 
V 



maxV 7ri^)^*E 

-oi,7 -^-^ ^1-^ 



(r) 



'^'^h-.A^l--r = Pl:T)l0g 



^^'/,,(a;i:r = Pl:r) 



(12) 
(13) 

(14) 
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where in (fTsT i we have used the law of total probability to condition the expectation over each state 
sequence f3i:r of M-f^ 

In general, maximizing ( fT4] l exactly sets the variational distribution to the true posterior and reduces 
(together with ( fTTT l) to the E-step of the HEM algorithm for hidden state models derived in ifTOl : 



E 



l0gp{xi;r,yi:T\M 



n (15) 



" (r) (r) 

where the inner expectation is taken with respect to the current estimate A4 j of Ai^ , and H is 



some term that does not depend on A4 



(r) 



3.3 The variational HEM for HMMs 

The maximization of ( fl4] i cannot be carried out in a efficient way, as it involves computing the 
expected log-likelihood of a mixture. To make it tractable we follow a variational approximation 
proposed by Hershey (14^, and restrict the maximization to factored distribution in the form of a 
Markov chain, i.e., 

r 

t=2 

where 01 = 1 V/3i and '/'J'^ (pt-i, Pt, A) = 1 V/3t,pt_i. 

Substituting ( fT6l l into (fT4b we get a lower bound to the expected log-likelihood ( fT4] l. i.e., 

\ogp{y^.,r\M\H > r-'{M^;\(t>''n V0''^' (17) 



3 

where we have defined 

Using the property of HMM with memory one that observations at different time instants are inde- 
pendent given the corresponding states, we can break the expectation term in equation (fTsT l in the 
following summation 

E^<. [logp(yi.|A^5:j^J] = ^i(A^Sj|>(M) (19) 

where L(A^f^^ 1 1 A^j^j^ ) = E^C') [logp(yt l-^j^jt )]■ ^.s the emission probabilities are GMMs, the 
computation ( fT9] l cannot be carried out efficiently. Hence, we use a variational approximation ifTSl . 
and introducte variational parameters 'ni\m ''''''''^ ^' ~ li • ■ • : with X^fci '^llm^'*'"''''^ ^ 
1 Vm, and ?7^|^f > OV^,m. Intuitively, ri^^^f^^'^^^p) is the responsibility matrix between gaus- 
sian observation components for state /3 in Mf^ and state p in M'^p, where fle\'Jri'^'''^'' means the 

(b) (r) 

probability that an observation from component m of Ai^ ^ corresponds to component £ of Aij ^. 
Again, we obtain a lower bound: 

L{M[%\\M^:l) ^ ^(-^illl-^ii) Vr,(-/^)'(^-'') (20) 



where we have defined: 

M M 

" " " (21) 



m=l 1=1 



where II A^p'^]'"') = E^|^(b),i[logP(y|7Wp'^]'-')] can be computed exactly for Gaussians 



i„(E<:r'E5) (22, 



-i(,<3-,SfES-'(.S-.S). (23) 
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Plugging into (fT9] l and ( fTsT i we get the lower bound to the expected log-likelihood: 

'\ogp{yi..r\M^;^)] > J■'^J■(XW,0*■^^7) V0■''^r; (24) 



where we have defined: 



Pl-.T Pl-T t=l 

Maximizing the right hand side of (|24] | with respect to (p rj finds the most accurate approximation 
to the real posterior within the restricted class, i.e., the one that achieves the tightest lower bound. 

Finally, plugging (l24l i into dUi, we obtain a lower bound on the log-likelihood of the virtual sample: 

JiM^"-^) > J{M^'^\z,cj>,'n) yz,(P,rj (26) 

where we have defined: 

1=1 j=i 1=1 j=i 



if (b) ^(r) g g 



i=l 3 = 1 [i=l p=l 

+ E E E E E ^(-^Sji-^Sj 

1=1 j = l ,3 = 1 p=l t=l 

- EE-^.^E-^^rEC.i...iogC.i... (27) 

i=l j = l /3 = 1 p=l 

To find the tightest possible lower bound to the log-likelihood of the virtual sample we need to solve 

> max J(XM,2,</),r;). (28) 

Starting from an initial guess for A4''^\ the parameters can be estimated by maximizing (l28T l irtera- 
tively with respect to (E-step) rj, (j), z and (M-step) M. '^''^ 

3.4 E-step 

The E-steps first considers the maximization of ( [28] l with respect to r) for fixed JVl^^^ z and 0, i.e., 

j7 = argmaxJ(A^W,2,</),r/) (29) 

It can be easily verified that the maximization ( |29] l does not depend on z and can be carried out 
independently for each tuple (i, j, /3, p, m) using the result in Section l6^ l 181. which gives: 

"" " e;1,cS;'=p{l„(>.™,„'ii>,<;;/)} 

and that the terms in (l2Tl i can then be computed for each (i, j, /3, p) as: 

M M 

L{M^],\\M^;l) = E 4'l^log^c(J^exp{LG(A^£:i|A^S'^)} . (31) 

m=l t=l 

Next, ( |28] l is maximized with respect to 4> for fixed (TW^*"^ z and r;), i.e., 

<f) — argmax 2,0,77) (32) 
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The maximization does not depends on z and can be carried out independently for each pair (i, f) 
with a backward recursion recursion lfT4l that computes 

</>'^U_i,p«,A) = — ^— . \ (33) 

a^l, exp I lA^S) + p)} 

AT AT 

= 5]a?):;^,log5^aW'^;^exp{L(Al(3l|A^W) + £j:^,(/^^ (34) 
p=i 

for T = T, . . . , 2, where it is understood that £^':^]^(/3f , pt) =0, and terminates with 

vrM^^- exp {L(A^f )J|A^W ) + 4'-'-(/3i,pi)} 

•/- ''(Pi,/?!) = , , - \ V (35) 

4^^" exp {l(A^^3^ I + 4- (/3i, p) } 

r'^{M^;\4'"\v) = ^4''^'Mog5]^W^^exp{L(A^(3l|Al(.:))+/:^^^(/3,p)}. (36) 

13=1 p=l 

Next, the maximization of (|28] | with respect to 2: for fixed TW'''^ </) and ry, i.e., 

£ = argmaxJ(7W(''),2,</),r7) (37) 

z 

reduces to compute the Zij as in (fTTT i using ( [36] l to approximate (fTOl i. 
Finally, we compute the following summary statistics: 

z.i'^(a,7) = 7rW-^<^i'^(a,7) (38) 

N 



iP, ^, 7) = «?,7 J (P, a, 7) for i = 2, . . . , r (39) 

AT 

'(a, 7) = ^et'^(p,a,7)fort = 2,...,T (40) 



p=i 



and the aggregates 



Af 



i^\:'i<^) =E'^i''(^'7) (41) 



7=1 



^>^'-''(a,7)=E^*'-''(^'T) (42) 



e^"(p,a)=^^et''^"(p,a,7). (43) 

t=2 7=1 

The quantity Vt'' {<J, 7) is the responsibility between state 7 of the HMM jW^-^^ and state <t of the 
HMM J\Aj at time t, when modeling a sequence generated by J\A\ . Similarly, the quantity 
^l'-' {p, a, 7) is the responsibility between a transition from state p to state cr (reached at time t) 
for the HMM A^^*^^ and state 7 (at time i) of the HMM AifK when modeling a sequence generated 
by Aif\ Consequently, the statistic i>['^ (a) is the expected number of times that the HMM A^j'^^ 
starts from state a, when modeling sequences generated by A4f'\ The quantity z>*'-'((t, 7) is the 

(b) (r) 

expected number of times that the HMM Ai] is state 7 when the HMM Ai j is in state a, when 
both modeling sequences generated by Aif''^ . Finally, the quantity (p, a) is the expected number 

(r) 

of transitions from state p to state a of the HMM A4 j , when modeling sequences generated by 
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3.5 M-step 

The M-steps involves maximizing ( |28] l with respect to A^^*"^ for fixed z and 77, i.e., 

= argmaxJ^(X(''\ 2,0,77). (44) 

In the following, we detail the update rules for the parameters of the reduced model MS"^^ . 

3.5.1 HMMs mixture weights 

The re-estimation of the mixture weights, given the constraint X^jLi '^j'^' — 1' is solved using the 
result in Section ( 16.11) : 

'^j - Kib) ■ (45) 

3.5.2 Initial state probabilities 

The cost function ( |28] ) factors independently for each {vr^'^^'"' }^^! (j is fixed) and reduces to terms 
in the form: 

N if'*"' 

J(XW,2,0,77) = ^^i.,,A^.:>i'^(a) log^'^)'^". (46) 

o-=l i=l 



Considering the constraint J2a=i ''^o- — 1, the results in Section |6T| gives the update formulas 

(47) 



3.5.3 State transition probabilities 

Similarly, the cost function (l28T l factors independently for each {a^p}r'''}^^i (j and p are fixed) and 
reduces to terms in the form: 

AT kC') 

Considering the constraint '^p^o^"' = 1' the results in Section l6n gives the update formula 

3.5.4 Eission probability density functions 



In general, in the cost function (|28] l factors independently for each j, p, I, and reduces to terms in 
the form: 

if*''' N M 

E z.,N,Y.-''^P^^) E 4',™ '^^f '^""^ {^ogc'^i'+LaiMf^aM';}')) (50) 

i=l 7—1 m—1 

Using basic matrix calculus, and defining a weighted sum operator 

M 

f^j,p,f(a;(j,/3,m)) = E^^.i^^E'^*"'^^'^) E 4*!" ^' ™) ^^^^ 

z p m—1 
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the parameters A^''"' are updated accordingly to 



17 



f7 



(^<9),(j,p) 



r2 



/'^(i,/3),0-,p)A 



(52) 



(53) 



(54) 



Equations ( |52ll54] l are all weighted averages over all base models, model states, and Gaussian com- 
ponents. 



4 Experiments 

In this section, we present three empirical studies of the VHEM-H3M algorithm. Each application 
exploits some of the benefits of VHEM. First of all, instead of clustering HMMs on the parameter 
manifold, VHEM-H3M clusters HMMs directly through the distributions they represent. Given a 
collection of input HMMs, VHEM estimates a smaller mixture of novel HMMs that consistently 
models the distribution represented by the input HMMs. This is achieved by maximizing the log- 
likelihood of "virtual" samples generated from the input HMMs. As a result, the VHEM cluster 
centers are consistent with the underlying generative probabilistic framework. 

Second, VHEM allows to estimate models from large-scale data sets, by breaking the learning prob- 
lem into smaller pieces. First, a data set is split into small non-overlapping portions and intermediate 
models are learned from each portion. Then, the final model is estimated from the intermediate mod- 
els using the VHEM-H3M algorithm. While based on the same maximum-likelihood principles as 
direct EM estimation on the full data set, this VHEM estimation procedure has significantly lower 
memory requirements, since it is no longer required to store the entire data set during parameter 
estimation. In addition, since the intermediate models are estimated independently of each other, 
this estimation task can easily be parallelized. Lastly, the "virtual" samples (i.e., sequences) VHEM 
implicitly generates for maximum-likelihood estimation need not be of the same length as the actual 
input data for estimating the intermediate models. Making the virtual sequences relatively short will 
positively impact the run time of each VHEM iteration. This may be achieved without loss of mod- 
eling accuracy, as VHEM allows to compensate for shorter virtual training sequences by implicitly 
integrating over a virtually unlimited number of them. 

4.1 Hierarchical motion clustering 




12345678 12345678 



IIIBI mil JI.ILMILJ 

5 10 15 20 25 30 35 40 45 50 55 

Figure 1 : Hierarchical clustering of the MoCap dataset, with VHEM and SC-PPK. 
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In this experiment we tested the VHEM algorithm on hierarchical motion clustering, using the Mo- 
tion Capture dataset (http://mocap.cs.cmu.edu/), which is a collection of time-series data represent- 
ing human motions. In particular, we start from Ki = 56 motion examples form 8 different classes, 
and learn a HMM for each of them, forming the first level of the hierarchy. A tree-structure is formed 
by successively clustering HMMs with the VHEM algorithm, and using the learned cluster centers 
as the representative HMMs at the new level. The second, third and fourth levels of the hierarchy 
correspond to, respectively, K2 — 8, Kj, — 4 and = 2. 

The hierarchical clustering obtained with VHEM is illustrated in Figure [1] (left). In the first level, 
each vertical bar represents a motion sequence, with different colors indicating different ground- 
truth classes. In the second level, the 8 HMM clusters are shown with vertical bars, with the colors 
indicating the proportions of the motion classes in the cluster Almost all clusters are populated by 
examples from a single motion class (e.g., "run", "jog", "jump"), which demonstrates that VHEM 
can group similar motions together. We note an error of the VHEM in clustering a portion of the 
"soccer" examples with "basket". Moving up the hierarchy, the VHEM algorithm clusters similar 
motions classes together (as indicated by the arrows), and at the last (Level 4) it creates a dichotomy 
between the "sit" and the rest of the motion classes. This is a desirable behavior as the a the kinetics 
of the "sit" sequences (i.e., sitting on a stool and going down) are considerably different form the 
rest. On the right of Figure [T] the same experiment is repeated using spectral clustering in tandem 
with PPK similarity (SC-PPK) 1 5 1 . The SC-PPK clusters motions sequences properly, however it in- 
correctly aggregates the "sit" and "soccer", and produces a last level (Level 4) not well interpre table. 

While VHEM has lower Rand-index than SC-PPK at Level 2 (0.940 vs. 0.973), it has higher Rand- 
index at Level 3 (0.775 vs. 0.737) and Level 4 (0.591 vs. 0.568). This suggests that the novel 
HMMs cluster centers learned by VHEM retain more information that the spectral cluster centers. 

annotation retrieval classification 

P R F MAP AROC P@10 VHEM-H3M EM-H3M 



HEM-H3M 0.470 0.210 0.258 0.438 0.700 0.450 r = 5 0.569 0.349 

EM-H3M 0.415 0.214 0.248 0.423 0.704 0.422 r = 10 0.570 0.389 

HEM-DTM 0.430 0.202 0.252 0.439 0.701 0.453 r = 15 0.573 0.343 

Table 1: Annotation and retrieval performance on CAL500, Table 2: Online hand-writing clas- 

for VHEM-H3M, EM-H3M and HEM-DTM ifTsll sification accuracy (20 characters) 



4.2 Automatic music tagging 

In this experiment we evaluated VHEM-H3M on automatic music tagging. We considered the 
CAL500 collection form Barrington et al. |T2l, which consists in 502 songs and provides binary 
annotations with respect to a vocabulary V of 149 tags, ranging from genre and instrumentation, to 
mood and usage. To represent the acoustic content of a song we extract a time series of audio fea- 
tures y = {yi, . . . , ut}, by computing the first 13 Mel frequency cepstral coefficients (MFCC) 1 1) 
over half-overlapping windows of 92ms of audio signal, augmented with first and second derivatives. 

Automatic music tagging is formulated as a supervised multi-class labeling problem ifTTl . where 
each class is a tag from V. We model tags with H3M probability distributions over the space of 
audio fragments (e.g., sequences of r = 125 audio features, which approximately corresponds to 6 
seconds of audio). Each tag model is learned from audio-fragments extracted from relevant songs 
in the database, using the VHEM-H3M. The database is first processed at the song level, using the 
EM algorithm to learn a H3M for each song from a dense sampling of audio fragments. For each 
tag, the song-level H3Ms that are relevant to the tag are pooled together to form a big H3M, and the 
VHEM algorithm is used to learn the final tag-model. 

In table [U we present a comparison of the VHEM-H3M algorithm with the standard EM-H3M algo- 
rithm and a state-of-the-art auto-tagger (HEM-DTM) [13], which uses the dynamic texture mixture 
model and an efficient HEM algorithm, on both annotation an retrieval on the CAL500 dataset. An- 
notation is measured with precision (P), recall (R), f-score (F), and retrieval is measured with mean 
average precision (MAP), area under the operating characteristic curve (AROC), and precision at 
the first 10 retrieved objects (P@10). All reported metrics are averages over the 98 tags that have at 
least 30 examples in CAL500, and are result of 5 fold-cross validation. VHEM-H3M achieves better 
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performance over EM-H3M (except on precision and AROC which are comparable) and strongly 
improves the top of the ranking list, as demonstrated by the higher P@10 score. Performance of 
VHEM-H3M and HEM-DTM are close on all metrics with only slight variations, except on annota- 
tion precision where VHEM-H3M registers a significantly higher score. 



4.3 Online hand-writing recognition 



In this experiment we investigated the performance of the VHEM-H3M algorithm on classification 
of on-line hand-writing. We considered the Character Trajectories Data Set 1 19], which consists in 
2858 examples of characters from the same writer, and used half of the data for training and half for 
testing. An HMM (with N = 2 and AI = 1) was first learned from each of the training sequences 
using the EM algorithm. For each letter, all the relevant HMMs were clustered with the VHEM to 
form a H3M with J^^'"' = 2 components. We repeated the same experiment using the EM-H3M 
algorithm directly on all the relevant sequences in the train data. For each letter, we allowed the 
EM algorithm up to three times the total running time of the VHEM (including the estimation of the 
corresponding intermediate HMMs). Table|2]lists classification accuracy on the test set, for VHEM- 
H3M, using different values of r, and for the corresponding runs of EM-H3M. A small r suffices to 
provide a regular estimate, and simultaneously determines shorter running times for VHEM (under 
2 minutes for all 20 letters). On the other hand, the EM algorithm needs to evaluate the likelihood 
of all the original sequences at each iteration, which determines slower iterations, and prevents the 
EM from converging to effective estimates in the time allowed. 



5 Conclusion 



In this paper, we present a variational HEM (VHEM) algorithm for clustering HMMs through their 
distributions. Moreover, VHEM summarizes each cluster by estimating a new HMM as cluster 
center. We demonstrate the efficacy of this algorithm for various applications, including hierarchical 
motion clustering, semantic music annotation, and online hand-writing recognition. 



6 Appendix on useful optimization problems 



6.1 



The optimization problem 

L 

^ /?£ log ae (55) 

e=i 

L 

.t. ^ = 1 



max 



L 

S.I 

1=1 

ai > 0, V£ 



is optimized by 



— — • (56) 



This can be easily computed with the optimization 

L 



{a|} = argmaxy^ ^£ log + A q<; - 1 
°" 1=1 \i=i j 

where the second term is a Lagrangian term for the weights to sum to 1, and noticing that the 
positivity constraints are automatically satisfied by ( 
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6.2 



The optimization problem 



L 



max 



^ {(3i - log ae) 



(57) 



L 



S.t. 



= 1 



^=1 



> 0, y£ 



is optimized by 



, exp I3i 
= 7^1 



Er=iexp/3; 



(58) 



This can be easilly computed with the optimization 




) 



where the second term is a Lagrangian term for the weights to sum to l,and noticing that the posi- 
tivity constraints are automatically satisfied by (ISST l. 
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